Multiple Regression Assumptions and Diagnostics

STA6235: Modeling in Regression

Introduction

  • Let us now review the “checks” we will perform on our models.

    • Model assumptions

    • Outliers

    • Influence

    • Leverage

    • Multicollinearity

Model Assumptions

  • Recall the glm, y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... \beta_k x_k + \varepsilon where \varepsilon is the residual error term.

  • Recall that the residual error is defined as \varepsilon = y - \hat{y} where

    • y is the observed value

    • \hat{y} is the predicted value

  • The residual tells us how far away the observation is from the response surface (the predicted value).

  • Ordinary least squares estimation means that we have minimized the overall error.

Model Assumptions

  • Recall the regression (and ANOVA) assumptions,

\varepsilon \overset{\text{iid}}{\sim} N(0, \sigma^2)

  • The assumptions are on the residual!

  • What this means:

    • Residuals are normally distributed

    • Distribution of residuals is centered at 0

    • Variance of residuals is some constant \sigma^2

Model Assumptions

  • We will assess the assumptions graphically.

    • Constant variance: scatterplot

    • Normal distribution: q-q plot

  • A package was written by a former student, classpackage.

    • If you are working on the server, the package is already installed.

    • If you are not working on the server, you need to install the package:

# install.packages("devtools") - use this if R tells you it can't find install_github()
library(devtools)
install_github("ieb2/class_package", force = TRUE)

Model Assumptions

  • Once installed, we call the package,
library(classpackage)
  • While there are several functions in this package, we are interested in the anova_check() function.
m <- lm(y ~ x1 + x2 + ..., data = dataset)
anova_check(m)
  • This will provide the scatterplot and the q-q plot.

Today’s Data

library(tidyverse)
office_ratings <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-17/office_ratings.csv')
head(office_ratings, n=4)

Model Assumptions

  • Further, recall the model we constructed,
m1 <- lm(imdb_rating ~ season + episode, data = office_ratings)
summary(m1)

Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.43672 -0.29274 -0.03135  0.26734  1.62060 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.605835   0.102452  83.999  < 2e-16 ***
season      -0.093289   0.015109  -6.174 4.11e-09 ***
episode      0.013616   0.005132   2.653  0.00868 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4888 on 185 degrees of freedom
Multiple R-squared:  0.1835,    Adjusted R-squared:  0.1747 
F-statistic: 20.79 on 2 and 185 DF,  p-value: 7.147e-09

Model Assumptions

  • Now, let’s check the assumptions on the model,
anova_check(m1)

Model Fit: R^2

  • If we want to know how well the model fits the particular dataset at hand, we can look at the R^2.

    • R^2 is the proportion of variance explained by the model.

    • Because it is a proportion, R^2 \in [0, 1] and is independent of the units of y.

  • If R^2 \to 0, the model does not fit the data well; if R^2 \to 1, the model fits the data well.

    • Note that if R^2=1, all observations fall on the response surface.

R^2 = \frac{\text{SSReg}}{\text{SSTot}}

Model Fit: R^2

  • Remember that we are partitioning the variability in y (SSTot), which is constant, into two pieces:

    • The variability explained by the regression model (SSReg).

    • The variability explained by outside sources (SSE).

  • As predictors are added to the model, we necessarily increase SSReg / decrease SSE.

  • We want a measure of model fit that is resistant to this fluctuation, R^2_{\text{adj}} = 1 - \left( \frac{n-1}{n-k-1} \right) \left( 1 - R^2 \right), where k is the number of predictor terms in the model.

Model Fit: R^2

  • In our example,
summary(m1)

Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.43672 -0.29274 -0.03135  0.26734  1.62060 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.605835   0.102452  83.999  < 2e-16 ***
season      -0.093289   0.015109  -6.174 4.11e-09 ***
episode      0.013616   0.005132   2.653  0.00868 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4888 on 185 degrees of freedom
Multiple R-squared:  0.1835,    Adjusted R-squared:  0.1747 
F-statistic: 20.79 on 2 and 185 DF,  p-value: 7.147e-09

Model Diagnostics: Outliers

  • Definition: data values that are much larger or smaller than the rest of the values in the dataset.

  • We will look at the standardized residuals, e_{i_{\text{standardized}}} = \frac{e_i}{\sqrt{\text{MSE}(1-h_i)}}, where

    • e_i = y_i - \hat{y}_i is the residual of the ith observation
    • h_i is the leverage of the ith observation
  • If |e_{i_{\text{standardized}}}| > 2.5 \ \to \ outlier.

  • If |e_{i_{\text{standardized}}}| > 3 \ \to \ extreme outlier.

Model Diagnostics: Outliers

  • We will use the rstandard() function to find the residuals.

  • For ease of examining in large datasets, we will use it to create a “flag.”

dataset <- dataset %>%
  mutate(outlier = abs(rstandard(m))>2.5)
  • We can count the number of outliers,
dataset %>% count(outlier)
  • We can just look at outliers from the dataset,
new_dataset <- dataset %>% 
  filter(outlier == TRUE)
  • We can also exclude outliers from the dataset,
new_dataset <- dataset %>% 
  filter(outlier == FALSE)

Model Diagnostics: Outliers

  • Let’s check for outliers in our example data,
office_ratings <- office_ratings %>% 
  mutate(outlier = abs(rstandard(m1))>2.5)
head(office_ratings, n = 3)

Model Diagnostics: Outliers

  • How many data points are outliers?
office_ratings %>% count(outlier)
  • There are 6 outliers (as defined by the residual \ge 2.5)

Model Diagnostics: Leverage and Influential Points

  • A leverage point is defined as follows:

    • A point for which x_i is far from the other values.
  • An influential point is defined as follows:

    • A point that significantly affects the regression model.
  • We check these together using Cook’s distance.

    • We will look for “spikes” in the plot.
  • We use the gg_cooksd() function from the lindia package to construct the graph.

    • Note that it is built in ggplot() – we can make modifications like we normally do in ggplot().
library(lindia)
gg_cooksd(m, show.threshold = FALSE) + theme_bw()

Model Diagnostics: Leverage and Influential Points

  • Let’s assess leverage and influence in the office ratings dataset.
library(lindia)
gg_cooksd(m1, show.threshold = FALSE)

Model Diagnostics: Multicollinearity

  • Collinearity/multicollinearity: a correlation between two or more predictor variables affects the estimation procedure.

  • We will use the variance inflation factor (VIF) to check for multicollinearity. \text{VIF}_j = \frac{1}{1-R^2_j},

  • where

    • j = the predictor of interest and j \in \{1, 2, ..., k \},
    • R^2_j results from regressing x_j on the remaining (k-1) predictors.
  • We say that multicollinearity is present if VIF > 10.

  • How do we deal with multicollinearity?

    • Easy answer: remove at least one predictor from the collinear set, then reassess VIF.

    • More complicated: discussing with collaborators/bosses.

Model Diagnostics: Multicollinearity

  • We will use the vif() function from the car package.
library(car)
vif(m)
  • Note: the car package overwrites some functions from tidyverse, so I typically do not load the full library.

  • There will be a VIF value for each predictor in the model.

Model Diagnostics: Multicollinearity

  • Let’s check the multicollinearity for our data,
car::vif(m1)
  season  episode 
1.017385 1.017385 
  • No multicollinearity is present.

Sensitivity Analysis

  • We can perform sensitivity analysis to determine how much our model changes when we exclude the outliers.

    • Model 1: model using data with all observations
    • Model 2: model using data without identified outliers
  • Questions we will ask:

    • How different are the \hat{\beta}_i?
    • Did a predictor go from being significant to non-significant? (or vice-versa?)
    • Does the direction of \hat{\beta}_i change?
    • What is the difference in R^2?
  • We only look at sensitivity analysis once (i.e., only remove data points once for reanalysis).

    • If we keep going, we will whittle down the dataset to as close to a “perfect fit” as possible, introducing other issues.

Sensitivity Analysis

  • Let’s perform sensitivity analysis on our example model.
office_ratings_outliers <- office_ratings %>% filter(outlier == TRUE)
head(office_ratings_outliers, n=6)
m1 <- lm(imdb_rating ~ season + episode, data = office_ratings)
summary(m1)

Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.43672 -0.29274 -0.03135  0.26734  1.62060 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.605835   0.102452  83.999  < 2e-16 ***
season      -0.093289   0.015109  -6.174 4.11e-09 ***
episode      0.013616   0.005132   2.653  0.00868 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4888 on 185 degrees of freedom
Multiple R-squared:  0.1835,    Adjusted R-squared:  0.1747 
F-statistic: 20.79 on 2 and 185 DF,  p-value: 7.147e-09
office_ratings_no_outliers <- office_ratings %>% filter(outlier == FALSE)
m2 <- lm(imdb_rating ~ season + episode, data = office_ratings_no_outliers)
summary(m2)

Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings_no_outliers)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.99158 -0.28643 -0.03396  0.27160  1.19157 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.665452   0.090148  96.125  < 2e-16 ***
season      -0.099590   0.013239  -7.523 2.51e-12 ***
episode      0.010129   0.004515   2.243   0.0261 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4228 on 179 degrees of freedom
Multiple R-squared:  0.2474,    Adjusted R-squared:  0.239 
F-statistic: 29.43 on 2 and 179 DF,  p-value: 8.923e-12

Wrap Up

  • Today we have covered

    • model assumptions

    • model diagnostics

    • sensitivity analysis

  • Today’s activity:

    • Check the model assumptions and diagnostics on the model constructed last week.